As you may know by reading the Perl Foundation blog, I was awarded a grant to work on a MediaWiki parser. Now, I knew about this grant a long time before it was announced, because it took a long time from the time the committee decided to give me the grant, until it was announced on the foundation's blog.
I've started working on some working code. However, as I've made only slow progress. There are several reasons for it:
Recently, I've been relatively lethargic. Being out of job, and without motivation to do anything, I don't seem to have the will to get things done. Most of the time, I just rest, play games, read emails and RSS, etc. but not really code.
The prospect of getting the money in return is not enough of a motivation to work on the parser.
This is an annoying task. So far the code I wrote, handles only a small subset of the syntax, but is already very complicated, monolithic, and "ugly". The MediaWiki syntax is highly irregular and I find that handling all the edge cases while outputting a well-formed stream of tokens, is hard.
It's complicated. Like I said, the syntax is highly irregular, which makes it a hard task. So I may feel intimidated by it, and as such de-motivated even more.
So to sum up - I've neglected working on it. There's still a substantial amount of code I've written with many extensive tests, but it still covers only a very small subset of the syntax. If someone wishes to help with this work, I'll be grateful to help him by giving him a commit access to the repository. But I don't feel very motivated to work on it myself.
I've been thinking of doing something to compensate for that. I'd like to help squash Archive::Zip bugs, but still need repository access. I'll also like to resume work on Test-Run, but may possibly need to re-implement it more directly above TAP::Parser and TAP::Harness. I've also been neglecting work on File::Find::Object, and can resume it. This was an alternative grant proposal that I submitted along with the MediaWiki parser grant.
I can also help resolve random bugs from rt.cpan.org, or, unrelated to Perl, dedicate more time to be a Linux kernel janitor. (Which I'm trying to do also because I hope it will help me find a job.).
In any case, I hope you're not too disappointed from my lack of willingness to work on the MediaWiki parser. I guess you can't always be successful at what you're trying to do.
I'm a massive fan of Wikipedia and other Wikimedia projects like Wikiquote, but I am very concerned that so much valuable data is being created in such a terrible format. I have been looking for other parsers, but so far have only found this Python MediaWiki parser which in any case doesn't separate the parsing and output phases.
Given MediaWiki's horrible syntax where 'normal' wikitext, HTML and CSS can be liberally mixed together, I'm not surprised you had significant problems writing a parser.
Still, I would like a Mediawiki parser, so I'm wondering if it's possible for the Perl Foundation to set up a fund specifically this project. I would gladly donate to it. Perhaps other people might be more motivated by funding.
As an aside, I'm starting to wonder if Wikimedia content can really be licensed under the GNU FDL, as the license states that a transparent copy of the content must be provided. "Transparent" is defined as "represented in a format whose specification is available to the general public, that is suitable for revising the document straightforwardly with generic text editors". Clearly it is possible to edit Mediawiki markup with a text editor, but there is no specification for Mediawiki markup. There are instead just several help documents and a basic attempt to create a spec. The FDL also adds "A copy made in an otherwise Transparent file format whose markup, or absence of markup, has been arranged to thwart or discourage subsequent modification by readers is not Transparent". I wouldn't go quite as far as saying that Mediawiki markup is deliberately obfuscated, as clearly that wouldn't be compatible with the project's aims, but it certainly doesn't lend itself to quick and easy modification.
In the longer term, if Wikipedia is serious about being around in 100 years, I think they really need to produce a proper Mediawiki markup specification, then develop a standalone (read: non hacky PHP) parser against it and use this to convert Mediawiki markup to a more well-designed wiki markup language like wikicreole.
Re:Great work - how can we fund its continuation?
Shlomi Fish on 2007-10-03T08:49:31
Hi!
Thanks for your comprehensive reply.
Thanks for your work so far!You're welcome, but I don't think I deserve a lot of thanks.
I'm a massive fan of Wikipedia and other Wikimedia projects like Wikiquote, but I am very concerned that so much valuable data is being created in such a terrible format. I have been looking for other parsers, but so far have only found this Python MediaWiki parser which in any case doesn't separate the parsing and output phases.
Well there's wiki2xml for MediaWiki (possibly based on the MW code) that converts MW code to XML. Maybe it's what you're looking for.
Given MediaWiki's horrible syntax where 'normal' wikitext, HTML and CSS can be liberally mixed together, I'm not surprised you had significant problems writing a parser.
Still, I would like a Mediawiki parser, so I'm wondering if it's possible for the Perl Foundation to set up a fund specifically this project. I would gladly donate to it. Perhaps other people might be more motivated by funding.
That may be a good idea.
As an aside, I'm starting to wonder if Wikimedia content can really be licensed under the GNU FDL, as the license states that a transparent copy of the content must be provided. "Transparent" is defined as "represented in a format whose specification is available to the general public, that is suitable for revising the document straightforwardly with generic text editors". Clearly it is possible to edit Mediawiki markup with a text editor, but there is no specification for Mediawiki markup. There are instead just several help documents and a basic attempt to create a spec. The FDL also adds "A copy made in an otherwise Transparent file format whose markup, or absence of markup, has been arranged to thwart or discourage subsequent modification by readers is not Transparent". I wouldn't go quite as far as saying that Mediawiki markup is deliberately obfuscated, as clearly that wouldn't be compatible with the project's aims, but it certainly doesn't lend itself to quick and easy modification.
Well, presumably, a human can take a document written in MediaWiki syntax and convert it to something more strict manually, so it may be OK.
In the longer term, if Wikipedia is serious about being around in 100 years, I think they really need to produce a proper Mediawiki markup specification, then develop a standalone (read: non hacky PHP) parser against it and use this to convert Mediawiki markup to a more well-designed wiki markup language like wikicreole.
Re:Parser inspirations...
Shlomi Fish on 2007-10-02T09:09:40
If worst comes to worst, you could always use PPI as a starting point.
Isn't PPI a parser for Perl 5 code? How will this help me parse MediaWiki syntax?